Recap

  • Regular expression
  • Steps for text analysis
  • R packages for text analysis
  • Tidy text
  • Stop words
  • Sentiment of the text
  • Word Importance

Outline

  1. What is cluster analysis?
  2. Distance Measure
  3. K-means algorithm
  4. Hierarchical algorithms
  5. Dendrograms

Cluster Analysis

Cluster Analysis

Cluster analysis is a statistical technique used to group similar data points into clusters according to their similarity on the variables.

  • Each cluster consists of data points that are more similar to each other than to those in other clusters.
  • It helps identify groupings in data without predefined labels.

Cluster Analysis

Applications

  • Market segmentation: A retail company might group customers into different groups based on buying behavior, demographics, or preferences to target marketing strategies accordingly.

  • Social Network Analysis: Cluster analysis can be used to detect groups of users with similar connections or interactions on social media platforms.

  • Recommender Systems: Streaming services like Netflix or Spotify use cluster analysis to group users with similar preferences, enabling the system to offer personalized recommendations.

  • Genetics and Bioinformatics: Cluster analysis is applied to genetic data to find groups of genes with similar expression patterns.

Example: 2D Data

2D data can be clustered manually when visualized on a scatter plot.

How many clusters can you spot?

Example: 2D Data

2D data can be clustered manually when visualized on a scatter plot.

How many clusters can you spot?

Example: Olive Oil Samples from Italy

Manually clustering high-dimensional data is more challenging. Tools like a tour or a scatter plot matrix can help visualize it. How many clusters can you spot?

library(langevitour)
library(tourr)
langevitour(scale(olive[, -c(1, 2)]), 
            pointSize = 2)

olive[, -c(1, 2)] %>%
  scale() %>%
  as.data.frame() %>%
  GGally::ggscatmat(alpha = 0.1) +
  theme_light() +
  xlab("") +
  ylab("")

Example: Olive Oil Samples from Italy

Manually clustering high-dimensional data is more challenging. Tools like a tour or a scatter plot matrix can help visualize it. How many clusters can you spot?

library(langevitour)
library(tourr)
langevitour(olive[, -c(1, 2)] %>% scale(), 
            group = c("North", "South", "Sardinia")[olive$region],
            pointSize = 2,
            levelColors = RColorBrewer::brewer.pal(3, "Dark2"))

olive[, -c(1, 2)] %>%
  scale() %>%
  as.data.frame() %>%
  mutate(region = c("North", "South", "Sardinia")[olive$region]) %>%
  GGally::ggscatmat(alpha = 0.1, color = "region") +
  theme_light() +
  xlab("") +
  ylab("") +
  theme(legend.position = "none") +
  scale_color_brewer(palette = "Dark2")

We need clustering algorithms!

Distance Measure

Distance Measure

We want to group similar data points into clusters, but how do we determine if two data points are “similar” or “close”?

A common choice is Euclidean distance:

\[d(A, B) = \sqrt{\sum_{k=1}^{p}(x_{ak} - x_{bk})^2} = \sqrt{(\boldsymbol{x}_A - \boldsymbol{x}_B)^\top(\boldsymbol{x}_A - \boldsymbol{x}_B)}\] where \(\boldsymbol{x}_A\) and \(\boldsymbol{x}_B\) are length-\(p\) column vectors representing points A and B, respectively, and \(x_{ak}\) and \(x_{bk}\) denote the \(k\)-th elements of these vectors.

Examples

  • For point \(A\): \((0, 1)\) and point \(B\): \((2, 3)\), \(d(A, B) = \sqrt{(0 - 2)^2 + (1 - 3)^2} = 2\sqrt{2}.\)

  • For point \(A\): \((0, 1, 2)\) and point \(B\): \((2, 3, 4)\), \(d(A, B) = \sqrt{(0 - 2)^2 + (1 - 3)^2 + (2 - 4)^2} = 2\sqrt{3}.\)

Distance Measure

Mahalanobis distance: \(\sqrt{(\boldsymbol{x}_A - \boldsymbol{x}_B)^\top S^{-1}(\boldsymbol{x}_A - \boldsymbol{x}_B)}\)

ggplot() +
  geom_point(data = mvtnorm::rmvnorm(10000, mean = c(0, 0), sigma = diag(c(4.17, 0.668))) %>% 
               as.data.frame(), 
             aes(V1, V2), alpha = 0.02) +
  ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 5, b = 2, angle = 0), linetype = 2) +
  annotate("point", x = 0, y = 0) +
  annotate("text", x = 0.5, y = 0.5, label = "A") +
  annotate("point", x = 5, y = 0) +
  annotate("text", x = 5.5, y = 0.5, label = "B") +
  annotate("point", x = 0, y = 2) +
  annotate("text", x = 0.5, y = 2.5, label = "C") +
  annotate("point", x = 2, y = 0) +
  annotate("text", x = 2.5, y = 0.5, label = "D") +
  coord_fixed() +
  theme_light() +
  scale_x_continuous(breaks = seq(-10, 10)) +
  ggtitle("D(A, B) = D(A, C) > D(A,D)") +
  xlab("X") +
  ylab("Y")

Manhanttan distance: \(\sum_{k=1}^p|x_{ak} - x_{bk}|\)

ggplot() +
  annotate("point", x = 0, y = 0) +
  annotate("text", x = 0.1, y = 0.1, label = "A") +
  annotate("point", x = 0, y = 2) +
  annotate("text", x = 0.1, y = 2.1, label = "C") +
  annotate("point", x = 1, y = 1) +
  annotate("text", x = 1.1, y = 1.1, label = "B") +
  annotate("point", x = 1, y = 2) +
  annotate("text", x = 1.1, y = 2.1, label = "D") +
  annotate("segment", x = 0, y = 0, xend = 0, yend = 1, linetype = 2) +
  annotate("segment", x = 0, y = 1, xend = 1, yend = 1, linetype = 2) +
  annotate("segment", x = 0, y = 1, xend = 0, yend = 2, linetype = 2) +
  annotate("segment", x = 1, y = 1, xend = 1, yend = 2, linetype = 2) +
  coord_fixed() +
  theme_light() +
  scale_x_continuous(breaks = seq(0, 1)) +
  scale_y_continuous(breaks = seq(0, 2)) +
  ggtitle("D(A, D) = 3 > D(A, B) = D(A, C) = 2") +
  xlab("X") +
  ylab("Y")

Rules of Distance Metric

A distance metric must follow these conditions:

  1. \(D(A, A) = 0\)
  2. \(D(A, B) > 0\)
  3. \(D(A, B) = D(B, A)\)
  4. \(D(A, C) \leq D(A, B) + D(B, C)\)

Can you come out with your own distance metric?

Tip

  • Many divergences used in statistics, like the Kullback–Leibler divergence, do not qualify as metrics.

  • Euclidean distance is unsuitable for categorical variables, and you’ll explore alternative distance metrics in ETC3250/ETC5250.

K-means Clustering

K-means Clustering

K-means is an iterative clustering algorithm which requires specifying the number of clusters (\(k\)) in advance.

The goal is to form \(k\) clusters that minimize within-cluster variance, expressed as:

\[\underset{\boldsymbol{S}}{\text{arg }\text{min}}\sum_{i=1}^{k}\underbrace{\sum_{\boldsymbol{x} \in S_i}\frac{1}{|S_i|}\lVert \boldsymbol{x} - \boldsymbol{\mu}_i\rVert^2}_{\text{variance of cluster } i},\]

where \(\boldsymbol{S} = \{S_1, S_2, \dots, S_k\}\) represents the \(k\) clusters, \(|S_i|\) is the number of points in cluster \(i\), and \(\boldsymbol{\mu}_i\) is the mean of the \(i\)-th cluster.

K-means Clustering

Smaller within-cluster variance indicates that the data points within each cluster are more tightly grouped together.

set.seed(2024)
dat <- bind_rows(mvtnorm::rmvnorm(n = 100, mean = c(5, 5), sigma = matrix(c(1, -0.7, -0.7, 1), ncol = 2)) %>%
            as.data.frame(),
          mvtnorm::rmvnorm(n = 100, mean = c(10, 10), sigma = matrix(c(1, -0.5, -0.5, 1), ncol = 2)) %>%
            as.data.frame(), 
          .id = "set") %>%
  mutate(cluster = c("Cluster 1", "Cluster 2")[as.integer(V2/V1 > 1) + 1])

dat %>%
  ggplot() +
  geom_point(aes(V1, V2, col = cluster)) +
  theme_light() +
  xlab("X") +
  ylab("Y") +
  coord_fixed() +
  labs(col = "") +
  ggtitle(dat %>% 
            group_by(cluster) %>% 
            summarise(var = var(V1) + var(V2)) %>% 
            pull(var) %>% 
            c(sum(.)) %>%
            format(digits = 3) %>% 
            {glue::glue("Within-cluster variance: {.[1]} + {.[2]} = {.[3]}")})

set.seed(2024)
dat <- bind_rows(mvtnorm::rmvnorm(n = 100, mean = c(5, 5), sigma = matrix(c(1, -0.7, -0.7, 1), ncol = 2)) %>%
            as.data.frame(),
          mvtnorm::rmvnorm(n = 100, mean = c(10, 10), sigma = matrix(c(1, -0.5, -0.5, 1), ncol = 2)) %>%
            as.data.frame(), 
          .id = "set") %>%
  mutate(set = as.integer(set)) %>%
  mutate(cluster = c("Cluster 1", "Cluster 2")[set])

dat %>%
  ggplot() +
  geom_point(aes(V1, V2, col = cluster)) +
  theme_light() +
  xlab("X") +
  ylab("Y") +
  coord_fixed() +
  labs(col = "") +
  ggtitle(dat %>% 
            group_by(cluster) %>% 
            summarise(var = var(V1) + var(V2)) %>% 
            pull(var) %>% 
            c(sum(.)) %>%
            format(digits = 3) %>% 
            {glue::glue("Within-cluster variance: {.[1]} + {.[2]} = {.[3]}")})

NP-hard

It is not feasible to find every possible way to create \(k\) clusters. How many ways are there to form \(k\) clusters?

\[S(n, k) = \frac{1}{k!}\sum_{i = 0}^{k}(-1)^{k-i}{k \choose i}i^{n}.\]

  • Approximately \(6.573841 \times 10^{67}\) ways for just 100 observations and 5 clusters!

However, comparing two solutions is relatively straightforward (see previous slide).

K-means is an example of an NP-hard problem:

  • A solution can be verified in polynomial time.

  • Finding the optimal solution may require an exponential amount of time.

  • NP-hard problems cannot be easily solved.

  • We often rely on approximation methods to find good solutions.

Lloyd’s algorithm

  1. Randomly divide the data into \(k\) groups.
  2. Calculate the means for each of the \(k\) groups.
  3. Assign each observation \(x_i\) to the group with the nearest mean.
  4. Repeat steps 2 and 3 until no observations are reassigned.

Example

Let’s try to cluster 8 data points into 2 groups using the Lloyd’s algorithm.

obs x y
A 8 2
B 2 4
C 7 2
D 10 4
E 5 5
F 3 10
G 1 10
H 1 7

Example

Set \(k = 2\), and randomly divide the data into 2 groups.

obs x y cluster
A 8 2 1
B 2 4 2
C 7 2 2
D 10 4 2
E 5 5 1
F 3 10 2
G 1 10 1
H 1 7 1

Example

Compute the means for each group. \(\boldsymbol{\mu}_1 = (3.75, 6)\) and \(\boldsymbol{\mu}_2 = (5.5, 5)\).

obs x y cluster
A 8 2 1
B 2 4 2
C 7 2 2
D 10 4 2
E 5 5 1
F 3 10 2
G 1 10 1
H 1 7 1

Example

Find the nearest mean for each observation.

obs x y cluster d1 d2 near
A 8 2 1 5.84 3.91 2
B 2 4 2 2.66 3.64 1
C 7 2 2 5.15 3.35 2
D 10 4 2 6.56 4.61 2
E 5 5 1 1.60 0.50 2
F 3 10 2 4.07 5.59 1
G 1 10 1 4.85 6.73 1
H 1 7 1 2.93 4.92 1

Example

Update the memberships.

obs x y cluster d1 d2 near
A 8 2 2 5.84 3.91 2
B 2 4 1 2.66 3.64 1
C 7 2 2 5.15 3.35 2
D 10 4 2 6.56 4.61 2
E 5 5 2 1.60 0.50 2
F 3 10 1 4.07 5.59 1
G 1 10 1 4.85 6.73 1
H 1 7 1 2.93 4.92 1

Example

Recompute the means for each group. \(\boldsymbol{\mu}_1 = (1.75, 7.75)\) and \(\boldsymbol{\mu}_2 = (7.5, 3.25)\).

obs x y cluster
A 8 2 2
B 2 4 1
C 7 2 2
D 10 4 2
E 5 5 2
F 3 10 1
G 1 10 1
H 1 7 1

Example

Find the nearest mean for each observation. No observations need to be reassigned, the algorithm has converged.

obs x y cluster d1 d2 near
A 8 2 2 5.84 3.91 2
B 2 4 1 2.66 3.64 1
C 7 2 2 5.15 3.35 2
D 10 4 2 6.56 4.61 2
E 5 5 2 1.60 0.50 2
F 3 10 1 4.07 5.59 1
G 1 10 1 4.85 6.73 1
H 1 7 1 2.93 4.92 1

Try K-means on olive

Although we know the data comes from three regions, we typically do not have this information beforehand for other datasets.

Tip

  • Experiment with different values of \(k\).

  • Evaluate the solutions using plots of the data or in a two-way table.

  • Be sure to use set.seed() since the results can depend on the initialization.

  • Switch cluster labels to ensure the largest values appear on the diagonal of the two-way table.

Check Group Means

Examine Results in 2D

Is it working well? Try using different variables for x and y.

Examine Results Using Tour

The algorithm will tend to segment the data into roughly equal sized, or spherical clusters, and thus will work well if the clusters are separated and equally spherical in shape.

If gaps appear within a single cluster, it suggests that k-means has failed to capture important underlying cluster structure.

by Interactively exploring high-dimensional data and models in R (Di Cook and Ursula Laa)

set.seed(12345)
result <- kmeans(scale(olive[, -c(1, 2)]), 
                 centers = 3)
langevitour(olive[, -c(1, 2)] %>% scale(), 
            group = result$cluster,
            pointSize = 2,
            levelColors = RColorBrewer::brewer.pal(3, "Dark2"))

Choosing Clusters

There is no correct number of clusters. Choosing the number of clusters depends on the context.

Do not choose too many clusters:

A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

Do not choose too few clusters:

If you choose the 1-cluster solution there is no point in doing clustering at all.

Cluster Statistics

The fpc package offers various cluster statistics that can assist you in comparing cluster solutions and determining the optimal number of clusters.

For more details, see ?cluster.stats().

Some commonly used cluster statistics

  • within.cluster.ss: Sum of squares within clusters. A lower value is better. This metric always decreases with more clusters, so focus on larger drops.
  • wb.ratio: The ratio of the average distance within clusters to the average distance between clusters. A lower value is preferable. This also decreases with more clusters, so choose when there are significant drops.
  • dunn: The minimum separation between clusters divided by the maximum cluster diameter. A higher value is better.
  • ch: The Calinski-Harabasz Index, which should also be high for a good clustering.

Let’s have a 15 mins break!

Hierarchical Clustering

Hierarchical Clustering

There are two types of hierarchical clustering:

  • Agglomerative: Start with each observation in its own cluster, and progressively merge them until all observations are grouped into a single cluster.
  • Divisive (not covered in this unit): Begin with all observations in one cluster, and sequentially split them until each observation is in its own cluster.

Hierarchical Clustering

Hierarchical clustering is a recursive algorithm:

  1. Identify the two closest data points in the dataset.
  2. Merge them to create a new “data point.”
  3. Repeat steps 1 and 2 until only one “data point” remains.

Distance between sets

We know how to calculate the distance between two individual points, but how do we compute the distance between a point and a set of points once they are merged?

For instance, how do we calculate the distance between point A and the set (B, C)? Or between the sets (A, C) and (B, D)?

Linkage

Linkage methods can be used to measure dissimilarity between sets of observations:

  • Single: The minimum distance between points in different clusters.
  • Complete: The maximum distance between points in different clusters.
  • Average: The average distance between points in different clusters.
  • Centroid: The distance between the centroids (averages) of different clusters.
  • Wards: Instead of merging the two closest data points, it merges the two points that minimize the total within-cluster variance.

Linkage

- Images from ETC3250/ETC5250 lecture 9.

Example

Hierarchical clustering begins by computing the distance matrix for the dataset.

Select an appropriate distance metric at this stage. Here, we are using Euclidean distance.

A B C D E F G H
A 0.00 6.32 1.00 2.83 4.24 9.43 10.63 8.60
B 6.32 0.00 5.39 8.00 3.16 6.08 6.08 3.16
C 1.00 5.39 0.00 3.61 3.61 8.94 10.00 7.81
D 2.83 8.00 3.61 0.00 5.10 9.22 10.82 9.49
E 4.24 3.16 3.61 5.10 0.00 5.39 6.40 4.47
F 9.43 6.08 8.94 9.22 5.39 0.00 2.00 3.61
G 10.63 6.08 10.00 10.82 6.40 2.00 0.00 3.00
H 8.60 3.16 7.81 9.49 4.47 3.61 3.00 0.00

Example

Then choose a linkage and starts finding two closest data points. Here we use single linkage.

A B C D E F G H
A 0.00 6.32 1.00 2.83 4.24 9.43 10.63 8.60
B 6.32 0.00 5.39 8.00 3.16 6.08 6.08 3.16
C 1.00 5.39 0.00 3.61 3.61 8.94 10.00 7.81
D 2.83 8.00 3.61 0.00 5.10 9.22 10.82 9.49
E 4.24 3.16 3.61 5.10 0.00 5.39 6.40 4.47
F 9.43 6.08 8.94 9.22 5.39 0.00 2.00 3.61
G 10.63 6.08 10.00 10.82 6.40 2.00 0.00 3.00
H 8.60 3.16 7.81 9.49 4.47 3.61 3.00 0.00

Example

Update the distance matrix using single linkage.

(A,C) B D E F G H
(A,C) 0.00 5.39 2.83 3.61 8.94 10.00 7.81
B 5.39 0.00 8.00 3.16 6.08 6.08 3.16
D 2.83 8.00 0.00 5.10 9.22 10.82 9.49
E 3.61 3.16 5.10 0.00 5.39 6.40 4.47
F 8.94 6.08 9.22 5.39 0.00 2.00 3.61
G 10.00 6.08 10.82 6.40 2.00 0.00 3.00
H 7.81 3.16 9.49 4.47 3.61 3.00 0.00

Example

Find two closest data points.

(A,C) B D E F G H
(A,C) 0.00 5.39 2.83 3.61 8.94 10.00 7.81
B 5.39 0.00 8.00 3.16 6.08 6.08 3.16
D 2.83 8.00 0.00 5.10 9.22 10.82 9.49
E 3.61 3.16 5.10 0.00 5.39 6.40 4.47
F 8.94 6.08 9.22 5.39 0.00 2.00 3.61
G 10.00 6.08 10.82 6.40 2.00 0.00 3.00
H 7.81 3.16 9.49 4.47 3.61 3.00 0.00

Example

Update the distance matrix using single linkage.

(A,C) B D E (F,G) H
(A,C) 0.00 5.39 2.83 3.61 8.94 7.81
B 5.39 0.00 8.00 3.16 6.08 3.16
D 2.83 8.00 0.00 5.10 9.22 9.49
E 3.61 3.16 5.10 0.00 5.39 4.47
(F,G) 8.94 6.08 9.22 5.39 0.00 3.00
H 7.81 3.16 9.49 4.47 3.00 0.00

Different Linkage Results

Dendrogram

  • Each leaf of the dendrogram represents an individual observation.
  • Leaves merge into branches, and branches can merge with either leaves or other branches.
  • Merges that occur lower in the tree indicate that the groups of observations are merged earlier and are more similar to one another.

Important

We can cut the tree to partition the data into \(k\) clusters

This allows us to obtain different clustering solutions from a single tree!

Cut the Tree Using Dendrogram

Stability of the Clustering Solution

One criterion for stability is that the number of clusters remains consistent over a wide range of tolerance levels.

In general, look for a long stretch of tolerance where the number of clusters does not change.

Tolerance level

  • Consider the axis representing distance as a measure of “tolerance level”.

  • When the distance between two clusters falls within this tolerance, they are merged into a single cluster.

  • As the tolerance increases, more clusters are merged, resulting in fewer overall clusters.

Try Hierarchical Clustering on olive

Examine Results Using Tour

Read this chapter about common patterns that will confuse clustering algorithm.

result <- dist(scale(olive[, -c(1, 2)])) %>%
  hclust(method = "ward.D2") 
langevitour(olive[, -c(1, 2)] %>% scale(), 
            group = cutree(result, 3),
            pointSize = 2,
            levelColors = RColorBrewer::brewer.pal(3, "Dark2"))

Conclusions

Hierarchical Clustering requires 4 steps:

  1. Standardize the data if they are in different units (scale()).
  2. Compute distance matrix (dist()).
  3. Choose an agglomeration method like single linkage.
  4. Run cluster analysis (hclust()) and visualize it (ggdendrogram()).

Remember to profile the clusters and to provide insight into what these clusters may represent.